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A PROBABILISTIC h METHOD FOR CLUSTERING HIGH DIMENSIONAL DATA 


TSVETAN ASAMOV AND ADI BEN-ISRAEL 


Abstract. In general, the clustering problem is NP-hard, and global optimality cannot be established for 
non-trivial instances. For high-dimensional data, distance-based methods for clustering or classification 
face an additional difficulty, the unreliability of distances in very high-dimensional spaces. We propose a 
probabilistic, distance-based, iterative method for clustering data in very high-dimensional space, using the 
£i -metric that is less sensitive to high dimensionality than the Euclidean distance. For K clusters in R n , 
the problem decomposes to K problems coupled by probabilities, and an iteration reduces to finding Kn 
weighted medians of points on a line. The complexity of the algorithm is linear in the dimension of the data 
space, and its performance was observed to improve significantly as the dimension increases. 


1. Introduction 

The emergence and growing applications of big data have underscored the need for efficient algorithms 
based on optimality principles, and scalable methods that can provide valuable insights at a reasonable 
computational cost. 

In particular, problems with high-dimensional data have arisen in several scientific and technical areas 
(such as genetics [T9], medical imaging [29] and spatial databases [21], etc.) These problems pose a special 
challenge because of the unreliability of distances in very high dimensions. In such problems it is often 
advantageous to use the ^i-metric which is less sensitive to the “curse of dimensionality” than the Euclidean 
distance. 

We propose a new probabilistic distance-based method for clustering data in very high-dimensional 
spaces. The method uses the ^-distance, and computes the cluster centers using weighted medians of 
the given data points. Our algorithm resembles well-known techniques such as fuzzy clustering [9] and 
It-means, and inverse distance interpolation |26]. 

The cluster membership probabilities are derived from necessary optimality conditions for an approx¬ 
imate problem, and decompose a clustering problem with K clusters in M n into Kn one-dimensional 
problems, which can be solved separately. The algorithm features a straightforward implementation and 
a polynomial running time, in particular, its complexity is linear in the dimension n. In numerical experi¬ 
ments it outperformed several commonly used methods, with better results for higher dimensions. 

While the cluster membership probabilities simplify our notation, and link our results to the theory of 
subjective probability, these probabilities are not needed by themselves, since they are given in terms of 
distances, that have to be computed at each iteration. 

1.1. Notation. We use the abbreviation 1 ,K := {1, 2,. .. , K} for the indicated index set. The com¬ 
ponent of a vector x, € M n is denoted x t [j]. The £ p - norm of a vector x = (x[j]) G is 

n 

Hp — (£ \m\K r 

3 = 1 
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and the associated ^-distance between two vectors x and y is d p (x, y) := ||x — y|| p , in particular, the 
Euclidean distance with p = 2, and the ^-distance, 

n 

di(x,y) = ||x - y||! = y |x[j] - y[j]|. (1) 

3 = 1 


1.2. The clustering problem. Given 

• a set X = {xj : i € 1 ,N} C M n of N points Xj in M n , 

• their weights W = {wi > 0 : % € 1,1V}, and 

• an integer 1 < K < N, 

partition X into K clusters {X/,. : k € 1 ,K}, dehned as disjoint sets where the points in each cluster are 
similar (in some sense), and points in different clusters are dissimilar. If by similar is meant close in some 
metric d(x, y), we have a metric (or distance based) clustering problem, in particular ^-clustering 
if the Id-distance is used, Euclidean clustering for the ^“distance, etc. 

1.3. Centers. In metric clustering each cluster has a representative point, or center, and distances to 
clusters are defined as the distances to their centers. The center c^ of cluster X/, : is a point c that minimizes 
the sum of weighted distances to all points of the cluster, 

Cfc := argmin { y w; d(x;, c)}. (2) 

Thus, the metric clustering problem can be formulated as follows: Given X, W and K as above, find centers 
{cfc : k € 1 ,K} C M n so as to minimize 

K 

min V' V' Wid(x.i, c k ), (L .K) 

Cl, ” ,C K ^' 

k —1 

where X^, is the cluster of points in X assigned to the center c/.. 


1.4. Location problems. Metric clustering problems often arise in location analysis, where X is the set 
of the locations of customers, W is the set of their weights (or demands), and it is required to locate 
K facilities {c^} to serve the customers optimally in the sense of total weighted-distances traveled. The 
problem (L .K) is then called a multi—facility location problem, or a location—allocation problem 
because it is required to locate the centers, and to assign or allocate the points to them. 

Problem (L .K) is trivial for K = N (every point is its own center) and reduces for K = 1 to the single 
facility location problem: find the location of a center c € M n so as to minimize the sum of weighted 
distances, 

N 

min \ Wid(-Xi, c). (L.l) 

c € M" 

For 1 < K < N, the problem (L .K) is NP-hard in general |23|, while the planar case can be solved 
polynomially in N, m- 


1.5. Probabilistic approximation. (L-K) can be approximated by a continuous problem 

K 


min y 

Cl,—,C K ■' 

k=\ 


Y. Wip k (x.i)d{xi, c k 
Xi eX 


(P.X) 


where rigid assignments x,; € X^ are replaced by probabilistic (soft) assignments, expressed by probabilities 
Pfc(xj) that a point x* belongs to the cluster X^. 
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For each point x, the cluster membership probabilities p k (x{) sum to 1, and are assumed to depend 
on the distances d(xi,c k ) as follows 


membership in a cluster is more likely the closer is its center 


(A) 


Given these probabilities, the problem (P .K) can be decomposed into K single facility location problems, 

min V' p k (xi)wi d(xi, c k ), (P.k) 

Cu ' 

x;€X 

for k € 1 ,K. The solutions c k of the K problems (P.k), are then used to calculate the new distances 
d(xj,Cfc) for all i € l,iV, k € 1 ,K, and from them, new probabilities {p k (xi)}, etc. 


1.6. The case for the 4 norm. In high dimensions, distances between points become unreliable [7J, and 
this in particular “makes a proximity query meaningless and unstable because there is poor discrimination 
between the nearest and furthest neighbor” [T] . For the Euclidean distance 

n n 

ek(x,y) = (Y |x[j] — y[j]| 2 ) 1/2 = (11 x 11 ^ 2 Y x b']yb'l + lly|||) 1/2 (3) 

j =i j =i 

between random points x, y € M n , the cross products x[j]y[j] in Q tend to cancel for very large n, and 
consequently, 

d 2 (x,y) » (||x||| + ||y|||) 1/2 . 

In particular, if x, y are random points on the unit sphere in M n then d- 2 (x, y) ~ y/2 for very large n. This 
“curse of high dimensionality” limits the applicability of distance based methods in high dimension. 

The 4-distance is less sensitive to high dimensionality, and has been shown to “provide the best dis¬ 
crimination in high-dimensional data spaces”, [1]. We use it throughout this paper. 

The plan of the paper. The £i-metric clustering problem is solved in § [2] for one center. A proba¬ 
bilistic approximation of (L .K) is discussed in § [3j the probabilities studied in §§0HHJ The centers of the 
approximate problem are computed in § [6l Our main result, Algorithm PCM(tb) of § [HI uses the power 
probabilities of § [71 and has running time that is linear in the dimension of the space, see Corollary [l] 
Theorem 1, a monotonicity property of Algorithm PCM(t'i), is proved in §[9j Section [TO] lists conclusions. 
Appendix A shows relations to previous work, and Appendix B reports some numerical results. 


2. The single facility location problem with the 4-norm 


For the 4-distance © the problem (L.l) becomes 


N N n 

min V w l d x {x l ,c), or min Y w i | x ibl “ c bl|, 
ceM n ceR n ' 1 1 

i =1 i =1 j =1 


(4) 


in the variable c € M", which can be solved separately for each component c[j], giving the n problems 

N 

ji. (5) 


min Y w i I x *b1 - cbll, j G 1, 

c[j]gk ^ 


i=l 


Definition 1. Let X = {xi, • • • , xjy} € R be an ordered set of points 


X\ < X2 < • • • < xn 

and let W = {rci, • • • ,wn} be a corresponding set of positive weights. A point x is a weighted median 
(or W median) of X if there exist a, (3 > 0 such that 

Y { Wi: Xi < + a = Yh { Wi: x i > x }+(3 


( 6 ) 
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where a + /3 is the weight of x if x € X, and a = /3 = 0 if a; 0 X. 

The weighted median always exists, but is not necessarily unique. 
Lemma 1. For X, W as above, define 


k 

E w i 


0 k : = 


and let k* be the smallest k with 8 k > 4. If 


then x k * is the unique weighted median, with 


i= 1 


N 

E 

i=l 


k € 1 ,N, 


0 k * > 4 


a = 


l 


+ Y Wk ~ Y Wk 5 P = ui k * - a. 


Wk* + Wk - 
\ k>k* k<k* J 

Otherwise, if 

Ok* 2 ’ 

then any point in the open interval ( x k *,x k «(i) is a weighted median with a = f3 = 0. 


(7) 

( 8 ) 
(9) 

( 10 ) 


Proof. The statement holds since the sequence (J7|) is increasing from 6\ = (uq/EfcLi w k) to On = 1. □ 

Note: In case CED, 

Y, {Wk ■ X k < X k *} = Y { Wk : Xk ~ 

we can take the median as the midpoint of x k * and x k *^i, in order to conform with the classical definition 
of the median (for an even number of points of equal weight) . 

Lemma 2. Given X and W as in Definition [lj the set of minimizers c of 


N 

X 

i =1 


Wj \Xi — c\ 


is the set of W -medians of X. 


Proof. The result is well known if all weights are 1. If the weights are integers, consider a point Xi with 
weight Wi as Wi coinciding points of weight 1 and the result follows. Same if the weights are rational. 
Finally, if the weights are real, consider their rational approximations. □ 


3. Probabilistic approximation of (L .K) 

We relax the assignment problem in (L.K) of § 11.21 bv using a continuous approximation as follows, 

K N 

min XX Wip k (xi)d(xi, c fc ) (P-77) 

k=l i= 1 

with two sets of variables, 
the centers {c^.}, and 

the cluster membership probabilities {p k (xi)}, 


Pfc(xj) := Prob{x,j € X fc }, i € 1,1V, k G 1 ,K, 


(11) 
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Because the probabilities (pfc(x,)} add to 1 for each i G 1,7V, the objective function of (P .K) is an upper 
bound on the optimal value of (L.A'), 

K N 

EE Wip k (y.i)d(yLi, c fc ) > min(L.A'), (12) 

k =1 2—1 

and therefore so is the optimal value of (P.AT), 

min (P.AT) > min (L.K). (13) 


4. Axioms for probabilistic distance clustering 


In this section, c4(x) stands for d k (x.,c k ), the distance of x to the center c k of the /c^-cluster, k G 1 ,K. 
To simplify notation, the point x is assumed to have weight w = 1. 

The cluster membership probabilities {p k (x.) : k G 1,K} of a point x depend only on the distances 
(<4(x) : k G 1 ,K}, 

p(x) = f(d(x)) (14) 

where p(x) G M A is the vector of probabilities (p^(x)), and d(x) is the vector of distances (c4(x)). Natural 
assumptions for the relation (|14l) include 

di(x) < dj(x) => p*(x) > Pj(x), for all i, j G 1,K (15a) 

f(Ad(x)) = f(d(x)), for any A > 0 (15b) 

Qp(x) = f(Qd(x)), for any permutation matrices Q (15c) 

Condition (|15al) states that membership in a cluster is more probable the closer it is, which is Assumption 
(A) of $ 11.51 The meaning of (115bl) is that the probabilities Pfc( x ) do not depend on the scale of measurement, 
i.e., f is homogeneous of degree 0. It follows that the probabilities p^(x) depend only on the ratios of the 
distances (c4(x) : k G 1,-K"}. 

The symmetry of f, expressed by (I15cl) . guarantees for each k G 1 ,K, that the probability pk{x) does 
not depend on the numbering of the other clusters. 

Assuming continuity of f it follows from (I15al) that 

di(x) = dj(x) => Pi(x) =pj(x), 

for any i,j G 1, K. 

For any nonempty subset S C 1 ,K, let 

Ps(x) = ^( x )> 

se<S 

the probability that x belongs to one of the clusters {C s : s G 5}, and let pfc(x|«S) denote the conditional 
probability that x belongs to the cluster C k , given that it belongs to one of the clusters {C s : s G 5}. 

Since the probabilities Pfc(x) depend only on the ratios of the distances {(4(x) : k G 1 ,K}, and these 
ratios are unchanged in subsets S of the index set 1,K, it follows that for all k G 1, K, 0 / 5 C 1 ,K, 


Pfc(x) = p fc (x|S)p 5 (x) (16) 

which is the choice axiom of Luce, [22, Axiom 1], and therefore, [30], 

^fc(x) 

E ^( x ) 

s£S 


Pfc(x|5) 


(17) 
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where v k (x) is a scale function, in particular, 


Pfc(x) = 


^fc(x) 


E_ v s (x) 

s&if< 


Assuming v k (x) 7 ^ 0 for all k, it follows that 

Pfc(x)u fc (x) _1 = 


1 


E_ «.(*)’ 

where the right hand side is a function of x, and does not depend on k. 

Property (I15al) implies that the function v k {-) is a monotone decreasing function of d k (x). 


(18) 


(19) 


5. Cluster membership probabilities as functions of distance 


Given K centers {c^}, and a point x with weight w and distances {d(x, c^) : k G 1,A"} from these 
centers, a simple choice for the function v k (x) in (1171) is 

VkM = 


for which (PSD givesQ, 


wd k (x) 


( 20 ) 

( 21 ) 


wpk(-x)d(x, c fc ) = D(x), k G 1 ,K, 
where the function D(x), called the joint distance function (JDF) at x, does not depend on k. 

For a given point x and given centers {c*.}, equations (1211) are optimality conditions for the extremum 
problem 


mm 


r K K _ 1 

{ p 2 k d(x,c k ) : ^2 Pk = 1, Pk > 0, k G 1,K \ 

{ k =1 k =1 J 


( 22 ) 


in the probabilities {p k := p k (x)}. The squares of probabilities in the objective of (1221) serve to smooth 
the underlying non-smooth problem, see the seminal paper by Teboulle m- Indeed, m follows by 
differentiating the Lagrangian 

L(p,A) = w Pfcd(x,c fc ) + A p k - l^j , (23) 

k =1 \fc=l / 

with respect to p k and equating the derivative to zero. 

Since probabilities add to one we get from (I2TT) . 


Pfc( x ) = 


n d(x,cj) 

_ 

E ])I d (x,C r. 

£=1 m^i 


k G 1 ,K, 


(24) 


and the JDF at x, 


D(x) = w ■ 


K 

II d(x,Cj) 

3 =1 


K 


(25) 


E II d(x,C m ) 

1= 1 

which is (up to a constant) the harmonic mean of the distances {d(x, c k ) : k G 1 ,K}, see also (IA-41) 
below. 


^There are other ways to model Assumption (A), e.g. [5], but the simple model (1211) works well enough for our purposes. 
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Note that the probabilities {p k (x.) : k G 1 ,K} are determined by the centers {c*. : k G 1 ,K} alone, while 
the function D(x.) depends also on the weight w. For example, in case K = 2, 

d(x,c 2 ) , , d(x,ci) 


PU X J = -T, -r—77-7, P2( x ) = i , ,, x, 

a(x, ex) + a(x, c 2 ) a(x, ci) + d(x, c 2 J 

p(x)=w 

a(x, ci) + d(x, c 2 ) 

6. Computation of centers 


(26a) 

(26b) 


We use the ^-distance 0) throughout. The objective function of (P .K) is a separable function of the 
cluster centers, 

K 

f(ci,...,c K ) := ^2 /fc(cfc), (27a) 

k =1 
N 

where f k ( c) := ^ Wjp k (x-i) di(xj, c), k G l,iO (27b) 


2—1 


The centers problem thus separates into K problems of type (j4j), 


AT n 

min n w iPk(*i) V' Ixibl - Cfcbll, A: € l,iF, 
c fc € M “ 

^ X —1 


(28) 


coupled by the probabilities {p k ( x i)}- Each of these problems separates into n problems of type (0 for 
the components c k [j], 

N 

min V' Wip k (xi) Ixjbl - c k \j] I, k G l,iF, j G l,n, (29) 

cfebleMS 

whose solution, by Lemma [21 is a weighted median of the points {x.j[j]} with weights {wi Pfc( x «)}- 

7. Power probabilities 


The cluster membership probabilities {pfc(x) : k G 1,A'} of a point x serve to relax the rigid assignment 
of x to any of the clusters, but eventually it may be necessary to produce such an assignment. One way 
to achieve this is to raise the membership probabilities p k (x) of (I24[) to a power v > 1, and normalize, 
obtaining the power probabilities 


£ p u M) 

3 =1 

which, by ([24ll . can also be expressed in terms of the distances d(x, c k ), 


(30) 


p!r } ( x ) : = 


FI d(x,Cj) u 

_ 

£ n d{x,c m y 

£=1 rriy^i 


k G 1 ,K. 


(31) 


As the exponent v increases the power probabilities (x) tend to hard assignments: If M is the index 
set of maximal probabilities, and M has elements, then, 


lim pM(x) = ( #?’ 

K [ 0, otherwise, 


(32) 
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and the limit is a hard assignment if ffM = 1, i.e. if the maximal probability is unique. 

Numerical experience suggests an increase of u at each iteration, see, e.g., (1331) below. 

8. Algorithm PCM(4): Probabilistic Clustering Method with 4 distances 

The problem (P .K) is solved iteratively, using the following updates in succession. 

Probabilities computation: Given K centers {c^}, the assignments probabilities {pj.!^(xj)} are cal¬ 
culated using (13T1) . The exponent u is updated at each iteration, say by a constant increment A > 0 , 

v := v + A (33) 

starting with an initial vq. 

Centers computation: Given the assignment probabilities {p[^(xj)}, the problem (P.K) separates 
into Kn problems of type (f29l) . 

N 

min Y Wip ( u\xi) jxibl - c k [j]\, k € 1,-fiT, j € l,n, (34) 

cfeb' eR , K 
1=1 

one for each component c k [j\ of each center c k , that are solved by Lemma [2j 
These results are presented in an algorithm form as follows. 

Algorithm PCM(4) : An algorithm for the l\ clustering problem 


Data: X = {x* : i € 1 ,N} data points, { Wi : i € 1 ,N} their weights, 

K the number of clusters, 
e > 0 (stopping criterion), 

v o > 1 (initial value of the exponent v), A > 0 (the increment in f|33(l .) 
Initialization: K arbitrary centers {c^ : k E 1,K}, v := uq. 

Iteration: 

Step 1 compute distances {di(x, c^) : k € 1,A"} for all x £ X 

Step 2 compute the assignments {’P ^( x ) : x € X, k € 1 ,K} (using (f3T]i ) 

Step 3 compute the new centers {cfc + : k € 1 ,K} (applying Lemma [2] to (j34]i ) 
K 

Step 4 if d\(ck+,Ck) < e stop 
k =i 

else v := v + A , return to step 1 


Corollary 1 . The running time of Algorithm PCM(4) is 

0{NK(K 2 + n)I), (35) 

where n is the dimension of the space, N the number of points, K the number of clusters, and I is the 
number of iterations. 

Proof. The number of operations in an iteration is calculated as follows: 

Step 1: 0(nNK), since computing the l\ distance between two n-dimensional vectors takes O(n) time, 
and there are N K distances between all points and all centers. 

Step 2: 0(NK s ), there are NK assignments, each taking 0(K 2 ). 
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Step 3: 0{nNK ), computing the weighted median of N points in M takes 0(N ) time, and Kn such 
medians are computed. 

Step 4: 0(nK), since there are K cluster centers of dimension n. 

The corollary is proved by combining the above results. □ 

Remark 1. 

(a) The result (|35l) shows that Algorithm PCM(^i) is linear in n, which in high-dimensional data is much 
greater than N and K. 

(b) The first few iterations of the algorithm come close to the final centers, and thereafter the iterations 
are slow, making the stopping rule in Step 4 ineffective. A better stopping rule is a bound on the number 
of iterations /, which can then be taken as a constant in (|35lh 

(c) Algorithm PCM(^i) can be modified to account for very unequal cluster sizes, as in [141]. This modifi¬ 
cation did not significantly improve the performance of the algorithm in our experiments. 

(d) The centers here are computed from scratch at each iteration using the current probabilities, unlike 
the Weiszfeld method pH] or its generalizations, m-m, where the centers are updated at each iteration. 

9. MONOTONICITY 

The centers computed iteratively by Algorithm PCM(^i) are confined to the convex hull of X, a compact 
set, and therefore a subsequence converges to an optimal solution of the approximate problem (P.iF), that 
in general is not an optimal solution of the original problem (L.K). 

The JDF of the data set X is defined as the sum of the JDF’s of its points, 

D(X) := Y D W- (36) 

xgX 

We prove next a monotonicity result for D(X). 

Theorem 1. The function D(X) decrease along any sequence of iterates of centers. 

Proof. The function -D(X) can be written as 

/ K \ 

D < x > : = X X Pfc(x) -D(x), since the probabilities add to 1, 

xex \fc=i ) 

K 

= Y^Z u; ( x )»(x) 2 di(x,c fc ), by (EH). (37) 

xGX k =1 

The proof is completed by noting that, for each x, the probabilities (pfc(x) : k € 1 ,K} are chosen as to 
minimize the function 

K 

Y ^(x)Pfc(x) 2 di(x, Cfc) (38) 

k =1 

for the given centers, see (1221) . and the centers {c^ : k € 1 ,K} minimize the function (l38j) for the given 
probabilities. □ 

Remark 2. The function D(X.) also decreases if the exponent v is increased in (|3Q[) . for then shorter 
distances are becoming more probable in (]37fl . 

10. Conclusions 

In summary, our approach has the following advantages. 

(1) In numerical experiments, see Appendix B, Algorithm PCM(^i) outperformed the fuzzy clustering 
£i-method, the it-means i\ method, and the generalized Weiszfeld method m 




10 


TSVETAN ASAMOV AND ADI BEN-ISRAEL 


(2) The solutions of (1221) are less sensitive to outliers than the solutions of (IA-51) . which uses squares 
of distances. 

(3) The probabilistic principle (IA-81) allows using other monotonic functions, in particular the expo¬ 
nential function <f>(d) = e d , that gives sharper results, and requires only that every distance d(x, c) 
be replaced by exp {d(x, c)}, |5l. 

(4) The JDF (l36|) of the data set, provides a guide to the “right” number of clusters for the given data, 

E3- 
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Appendix A: Relation to previous work 


Our work brings together ideas from four different areas: inverse distance weighted interpolation, fuzzy 
clustering, subjective probability, and optimality principles. 

1. Inverse distance weighted (or IDW) interpolation was introduced in 1965 by Donald Shepard, 
who published his results j26j in 1968. Shepard, then an undergraduate at Harvard, worked on the following 
problem: 

A function u : M n -A K is evaluated at K given points {x*, : k € 1 ,K} in M”, giving the values 
{uk : k € 1 ,K}, respectively. These values are the only information about the function. It is required to 
estimate u at any point x. 

Examples of such functions include rainfall in meteorology, and altitude in topography. The point x 
cannot be too far from the data points, and ideally lies in their convex hull. 

Shepard estimated the value u(x) as a convex combination of the given values Uk, 

K 

«( x ) = A fe ( x ) u k (A-l) 

k =l 


where the weights Afc(x) are inversely ptoportional to the distances d(x, x&) between x and x*,, say 

/ 1 \ 

K 

^( x ) = X! IV ' ' u k 


k=1 


E 1 


(A-2) 


\j=i / 
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giving the weights 


Afc(x) = 


FI d(x,xj) 

3+k _ 

K 

e n fi (x. x„ 

1= 1 m^£ 


(A-3) 


that are identical with the probabilities (1241) . if the data points are identified with the centers. IDW 
interpolation is used widely in spatial data analysis, geology, geography, ecology and related areas. 
Interpolating the K distances d(x, x*,), i.e. taking Uk = d(x, x^.) in (IA-21) . gives 


K 


K 

n 

3 =1 


K 


(A-4) 


E II d(x,x n 

(.= 1 rrv£l 


the harmonic mean of the distances {d(x, x^.) : k £ 1 ,K}, which is the JDF in (|25l) multiplied by a scalar. 

The harmonic mean pops up in several areas of spatial data analysis. In 1980 Dixon and Chapman [T2] 
posited that the home-range of a species is a contour of the harmonic mean of the areas it frequents, and 
this has since been confirmed for hundreds of species. The importance of the harmonic mean in clustering 
was established by Teboulle ED, Stanforth, Kolossov and Mirkin [25|, Zhang, Hsu, and Dayal mm, 
Ben Israel and Iyigun [5j and others. Arav [3] showed the harmonic mean of distances to satisfy a system 
of reasonable axioms for contour approximation of data. 

2. Fuzzy clustering introduced by J.C. Bezdek in 1973, [8], is a relaxation of the original problem, 
replacing the hard assignments of points to clusters by soft, or fuzzy, assignments of points simultaneously 
to all clusters, the strength of association of x* with the cluster is measured by wik € [0,1]. 

In the fuzzy c-means (FCM) method [9] the centers {c/j} are computed by 

N K 


nun 


where the weights xn- are 


updated aj§ 


EE 

i =1 k =1 


W ik \\ x i- c kh, 


Wife — 


K 

E 

3 =1 


Xj — Cfc h 


2/m— 1 


(A-5) 


(A-6) 


|Xi C j ||2 y 

and the centers are then calculated as convex combinations of the data points, 

/ \ 


C k = 


N 

E 

Z— 1 


W, 


ik 


N 

E w 

\ 3 =1 


jk 


x,'. fee l ,K. 


(A-7) 


2 The weights (IA^6l) are optimal for the problem (IXT1) if they are probabilities, i.e. if they are required to add to 1 for every 
point Xi. 














HIGH DIMENSIONAL CLUSTERING 


13 


The constant m > 1 (the “fuzzifier”) controls he fuzziness of the assignments, which become hard assign¬ 
ments in the limit asmjl. For m = 1, FCM is the classical AT-means method. If m = 2 then the weights 
Wik are inversely proportional to the square distance ||xi — Cfc |||, analogously to (12TT) . 

Fuzzy clustering is one of the best known, and most widely used, clustering methods. However, it may 
need some modification if the data in question is very high-dimensional, see, e.g. [20] . 

3. Subjective probability. There is some arbitrariness in the choice of the model and the fuzzifier m 
in (1A-5I) (lA-fij) . In contrast, the probabilities (f2H) can be justified axiomatically. Using ideas and classical 
results ([22], [30]) from subjective probability it is shown in Appendix B that the cluster membership 
probabilities Pfc(x), and distances (4(x), satisfy an inverse relationship, such as, 

PA:(x) </>(d(x, Cfe )) = /(x), k G 1 ,K, (A-8) 


where </>(•) is non-decreasing, and /(x) does not depend on k. In particular, the choice (p(d ) = d gives (12TT) . 
which works well in practice. 

4. Optimality principle. Equation (1A-8I) is a necessary optimality condition for the problem 

( K K \ 


min 


(A-9) 


.fc=1 


k=1 


that reduces to (l22j) for the choice 4>(d) = d. This shows the probabilities {p/c(x)} of ([24jl to be optimal, 
for the model chosen. 


Remark 3. Minimizing a function of squares of probabilities seems unnatural, so a physical analogy may 
help. Consider an electric circuit with K resistances {Rk} connected in parallel. A current I through 
the circuit splits into K currents, with current /*. through the resistance R These currents solve an 
optimization problem (the Kelvin principle) 

min 

h,-,i K 

that is analogous to (12211 . The optimality condition for (IA-101) is Ohm’s law, 

h Rk = constant 

a statement that potential is well defined, and an analog of (I21jl . The equivalent resistance of the circuit, 
i.e. the resistance R such that / 2 R is equal to the minimal value in (IA-101) . is then the JDF (1251) with Rj 
instead of d(x, c j) and w = 1 . 

Appendix B: Numerical Examples 

In the following examples we use synthetic data to be clustered into K = 2 clusters. The data consists 
of two randomly generated clusters, Xi with N\ points, and X 2 with N 2 points. 

The data points x = {x\, ■ ■ ■ ,x n ) G M n of cluster X^ are such that all of their components aq, 1 < i < n 
are generated by sampling from a distribution iq. with mean fik, k = 1,2. In cluster Xi we take /g = 1, 
and in cluster X 2 , fi2 = — 1 . 

We ran Algorithm PCM(^i), with the parameters vq = 1, A = 0.1, and compared its performance with 
that of the fuzzy clustering method [9] with the i\ norm, as well as the generalized Weiszfeld algorithm 


K 


K 


E ^ 


= I 


k=1 


k=1 


(A-10) 
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a 

Method 

n = 10 4 

n = 5 • 10 4 

n = 10° 

n = 5 • 10 s 

n = 10 b 

a = 8 

PCM (4) 

0.0 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

27.1 

38.6 

24.4 

40.9 

40.1 


K -means (^ 1 ) 

28.9 

26.8 

12.7 

22.4 

22.9 


Gen. Weiszfeld 

48.5 

48.8 

48.0 

48.2 

47.9 

a = 16 

PCM (G) 

4.3 

0.0 

0.0 

4.7 

0.0 


FCM (4) 

41.0 

42.1 

44.5 

43.9 

39.5 


AWneans (^ 1 ) 

41.8 

35.2 

23.7 

23.5 

23.6 


Gen. Weiszfeld 

48.0 

47.0 

48.4 

48.6 

48.0 

a = 24 

PCM (G) 

42.6 

8.8 

0.8 

4.8 

0.0 


FCM (h) 

46.4 

45.9 

47.5 

39.5 

45.1 


A"-means (G) 

45.5 

42.6 

35.6 

28.0 

24.5 


Gen. Weiszfeld 

47.9 

47.8 

47.1 

48.0 

48.2 

a = 32 

PCM (C) 

46.0 

42.2 

13.4 

13.6 

0.0 


FCM (G) 

47.4 

46.0 

44.8 

46.0 

46.0 


A"-means (^ 1 ) 

46.4 

45.7 

40.3 

36.0 

30.7 


Gen. Weiszfeld 

48.2 

48.9 

48.5 

48.9 

47.8 


Table 1. Percentages of misclassified data in Example [T] 


of [T8] (that uses Euclidean distances), and the £i-K-Means algorithm j23j. For each method we used a 
stopping rule of at most 100 iterations (for Algorithm PCM(G) this replaces Step 4). For each experiment 
we record the average percentage of misclassification (a misclassification occurs when a point in Xi is 
declared to be in X 2 , or vice versa) from 10 independent problems. In examples I1I2I3I we choose the 
probability distributions to be = iV(/ifc,cr). 


Example 1. In this example the clusters are of equal size, N\ = N 2 = 100. Table [l] gives the percentages 
of misclassification under the five methods tested, for different values of a and dimension n. 

Example 2. We use Aq = 200 and N 2 = 100. Table [2] gives the percentages of misclassifications for 
different values of a and dimension n. 

Example 3. In this case N\ = 1000, N 2 = 10. The percentages of misclassification are included in Table[3j 


In addition to experiments with normal data, we also consider instances with uniform data in Examples 
[4]and[H In this case l 7 ), is a uniform distribution with mean /ifc and support length |supp(Afc)|. 

Example 4. We use N\ = 100, AT 2 = 100. The results are shown in Table [4] 

Example 5. In this instance N± = 200, N 2 = 100. The results are shown in Table [5j 
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a 

Method 

n = 10 4 

n = 5 • 10 4 

n = 10° 

n = 5 • 10 3 

n = 10 b 

a = 8 

PCM (G) 

0.0 

0.0 

0.0 

0.0 

0.0 


FCM (G) 

11.9 

19.1 

25.2 

30.1 

22.6 


K -means (G) 

20.8 

25.9 

18.4 

31.4 

13.6 


Gen. Weiszfeld 

37.8 

37.9 

37.2 

36.7 

36.4 

<7 = 16 

PCM (G) 

10.4 

0.0 

0.0 

0.0 

0.0 


FCM (G) 

37.7 

35.6 

35.0 

36.2 

39.4 


AMneans (It) 

35.8 

32.0 

23.6 

31.7 

14.1 


Gen. Weiszfeld 

38.0 

37.7 

35.8 

36.6 

37.8 

a = 24 

PCM (G) 

44.1 

5.9 

1.2 

0.0 

0.0 


FCM (4) 

41.3 

37.7 

38.9 

36.7 

34.6 


AWneans (G) 

40.3 

39.9 

32.7 

33.3 

15.5 


Gen. Weiszfeld 

36.8 

37.7 

36.7 

36.9 

37.2 

u = 32 

PCM (G) 

47.2 

38.7 

18.5 

0.0 

0.0 


FCM (G) 

42.3 

38.8 

37.0 

39.7 

38.9 


AWneans (G) 

41.5 

42.9 

37.2 

36.8 

22.6 


Gen. Weiszfeld 

36.7 

36.9 

36.0 

36.5 

37.4 


Table 2. Percentages of misclassified data in 

Example [2] 


a 

Method 

n = 10 3 

n = 5 • 10 3 

n = 10 4 

n = 5 • 10 4 

n = 10 b 

a = 0.4 

PCM (G) 

46.4 

41.1 

24.1 

5.1 

0.9 


FCM (G) 

13.4 

0.5 

0.0 

19.5 

32.0 


/T-means (G) 

37.4 

31.6 

27.1 

36.5 

32.6 


Gen. Weiszfeld 

35.4 

36.7 

32.6 

33.7 

38.7 

<7 = 0.8 

PCM (G) 

47.4 

31.4 

23.4 

5.4 

1.8 


FCM (G) 

29.3 

9.5 

13.0 

27.3 

37.2 


/T-means (G) 

37.5 

32.0 

27.1 

36.4 

32.6 


Gen. Weiszfeld 

30.3 

31.6 

25.9 

27.9 

34.5 

a = 1.2 

PCM (G) 

47.3 

33.9 

26.2 

7.7 

1.6 


FCM (G) 

36.4 

20.8 

23.2 

31.1 

23.9 


/T-means (G) 

38.4 

32. 2 

28.3 

36.4 

32.6 


Gen. Weiszfeld 

22.1 

23.8 

26.8 

21.6 

25.5 

a = 1.6 

PCM (G) 

47.8 

35.4 

27.9 

9.8 

3.6 


FCM (G) 

41.1 

27.8 

30.0 

27.6 

24.2 


AT-means (G) 

37.6 

32.3 

28.3 

36.4 

33.4 


Gen. Weiszfeld 

23.1 

23.2 

21.1 

25.4 

31.6 


Table 3. Percentages of misclassified data in Example [3] 


In all examples Algorithm PCM(G) was unsurpassed and was the clear winner in Examples [U El [4] and 
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|supp(F)| 

Method 

n = 10 4 

O 

T - 1 

kO 

II 

g 

n = lO 5 

n = 5 • 10° 

n = 10 b 

|supp(F)| =8 

PCM (4) 

0.0 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

0.0 

0.1 

0.3 

2.7 

0.1 


A'-means (4) 

5.0 

5.0 

4.8 

0.0 

0.0 


Gen. Weiszfeld 

0.0 

0.0 

0.0 

0.0 

0.0 

|supp(A)| = 16 

PCM (4) 

0.0 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

8.9 

29.1 

26.6 

21.9 

25.6 


A'-means (4) 

23.8 

25.9 

18.0 

23.4 

17.8 


Gen. Weiszfeld 

47.0 

49.2 

46.8 

46.2 

47.4 

|supp(A)| = 24 

PCM (4) 

0.0 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

23.6 

39.1 

20.1 

27.8 

25.4 


A'-means (4) 

32.0 

27.2 

18.7 

23.2 

18.1 


Gen. Weiszfeld 

47.1 

47.4 

48.0 

47.4 

47.3 

|supp(A)| = 32 

PCM (4) 

0.3 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

28.8 

39.9 

36.6 

42.6 

38.8 


A'-means (4) 

35.7 

27.5 

19.3 

23.4 

18.6 


Gen. Weiszfeld 

48.1 

48.0 

47.9 

47.8 

47.9 


Table 4. Percentages of misclassified data in Example [5] 


|supp(A)| 

Method 

n = 10 4 

O 

T - 1 

kO 

II 

g 

n = lO 5 

n = 5 • 10° 

3 

II 

I— 1 

o 

a 

|supp(A)| = 8 

PCM (4) 

0.0 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

0.0 

10.0 

7.2 

0.4 

0.4 


A'-means (4) 

4.9 

13.5 

0.0 

4.9 

14.4 


Gen. Weiszfeld 

0.0 

0.0 

13.1 

0.0 

0.0 

supp(A)| = 16 

PCM (4) 

0.0 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

30.8 

28.0 

28.3 

14.8 

18.3 


A'-means (4) 

22.2 

31.8 

18.4 

17.6 

32.0 


Gen. Weiszfeld 

39.2 

36.6 

35.7 

36.7 

36.3 

supp(A)| = 24 

PCM (4) 

0.0 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

21.0 

26.1 

30.3 

27.5 

37.6 


A'-means (4) 

32.3 

36.3 

22.6 

18.1 

35.1 


Gen. Weiszfeld 

37.4 

38.4 

37.6 

36.5 

37.8 

supp(A)| = 32 

PCM (4) 

1.5 

0.0 

0.0 

0.0 

0.0 


FCM (4) 

38.0 

35.0 

36.5 

38.5 

33.5 


A'-means (4) 

35.1 

36.6 

23.1 

18.6 

35.4 


Gen. Weiszfeld 

39.7 

36.0 

37.5 

40.0 

38.0 


Table 5. Percentages of misclassified data in Example [5] 






























